
HL Paper 3
This question explores methods to determine the area bounded by an unknown curve.
The curve y=f(x) is shown in the graph, for 0⩽x⩽4.4.
The curve passes through the following points.
It is required to find the area bounded by the curve, the -axis, the -axis and the line .
One possible model for the curve is a cubic function.
A second possible model for the curve is an exponential function, , where .
Use the trapezoidal rule to find an estimate for the area.
With reference to the shape of the graph, explain whether your answer to part (a)(i) will be an over-estimate or an underestimate of the area.
Use all the coordinates in the table to find the equation of the least squares cubic regression curve.
Write down the coefficient of determination.
Write down an expression for the area enclosed by the cubic function, the -axis, the -axis and the line .
Find the value of this area.
Show that .
Hence explain how a straight line graph could be drawn using the coordinates in the table.
By finding the equation of a suitable regression line, show that and .
Hence find the area enclosed by the exponential function, the -axis, the -axis and the line .
Markscheme
Area M1A1
Area = 156 units2 A1
[3 marks]
The graph is concave up, R1
so the trapezoidal rule will give an overestimate. A1
[2 marks]
M1A2
[3 marks]
A1
[1 mark]
Area A1A1
[2 marks]
Area = 145 units2 (Condone 143–145 units2, using rounded values.) A2
[2 marks]
M1
A1
AG
[2 marks]
Plot against . R1
[1 mark]
Regression line is M1A1
So gradient = 0.986 R1
M1A1
[5 marks]
Area units2 M1A1
[2 marks]
Examiners report
In this question you will explore possible models for the spread of an infectious disease
An infectious disease has begun spreading in a country. The National Disease Control Centre (NDCC) has compiled the following data after receiving alerts from hospitals.
A graph of against is shown below.
The NDCC want to find a model to predict the total number of people infected, so they can plan for medicine and hospital facilities. After looking at the data, they think an exponential function in the form could be used as a model.
Use your answer to part (a) to predict
The NDCC want to verify the accuracy of these predictions. They decide to perform a goodness of fit test.
The predictions given by the model for the first five days are shown in the table.
In fact, the first day when the total number of people infected is greater than 1000 is day 14, when a total of 1015 people are infected.
Based on this new data, the NDCC decide to try a logistic model in the form .
Use the data from days 1–5, together with day 14, to find the value of
Use an exponential regression to find the value of and of , correct to 4 decimal places.
the number of new people infected on day 6.
the day when the total number of people infected will be greater than 1000.
Use your answer to part (a) to show that the model predicts 16.7 people will be infected on the first day.
Explain why the number of degrees of freedom is 2.
Perform a goodness of fit test at the 5% significance level. You should clearly state your hypotheses, the p-value, and your conclusion.
Give two reasons why the prediction in part (b)(ii) might be lower than 14.
.
.
.
Hence predict the total number of people infected by this disease after several months.
Use the logistic model to find the day when the rate of increase of people infected is greatest.
Markscheme
M1A1A1
[3 marks]
A1
number of new people infected = 247 – 140 = 107 M1A1
[3 marks]
use of graph or table M1
day 9 A1
[2 marks]
9.7782(1.7125)1 M1
= 16.7 people AG
[1 mark]
2 parameters ( and ) were estimated from the data. R1
M1
= 2 AG
[2 marks]
data is modeled by and data is not modeled by A1
p-value = 0.893 A2
Since 0.893 > 0.05 R1
Insufficient evidence to reject . So data is modeled by A1
[5 marks]
vaccine or medicine might slow down rate of infection R1
People become more aware of disease and take precautions to avoid infection R1
Accept other valid reasons
[2 marks]
1060 M1A1
[2 marks]
108 A1
[1 mark]
0.560 A1
[1 mark]
As M1
A1
[2 marks]
sketch of or solve M1
A1
Day 8 A1
[3 marks]
Examiners report
A smartphone’s battery life is defined as the number of hours a fully charged battery can be used before the smartphone stops working. A company claims that the battery life of a model of smartphone is, on average, 9.5 hours. To test this claim, an experiment is conducted on a random sample of 20 smartphones of this model. For each smartphone, the battery life, hours, is measured and the sample mean, , calculated. It can be assumed the battery lives are normally distributed with standard deviation 0.4 hours.
It is then found that this model of smartphone has an average battery life of 9.8 hours.
State suitable hypotheses for a two-tailed test.
Find the critical region for testing at the 5 % significance level.
Find the probability of making a Type II error.
Another model of smartphone whose battery life may be assumed to be normally distributed with mean μ hours and standard deviation 1.2 hours is tested. A researcher measures the battery life of six of these smartphones and calculates a confidence interval of [10.2, 11.4] for μ.
Calculate the confidence level of this interval.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
Note: In question 3, accept answers that round correctly to 2 significant figures.
A1
[1 mark]
Note: In question 3, accept answers that round correctly to 2 significant figures.
the critical values are (M1)(A1)
i.e. 9.3247…, 9.6753…
the critical region is < 9.32, > 9.68 A1A1
Note: Award A1 for correct inequalities, A1 for correct values.
Note: Award M0 if t-distribution used, note that t(19)97.5 = 2.093 …
[4 marks]
Note: In question 3, accept answers that round correctly to 2 significant figures.
(A1)
(M1)
=0.0816 A1
Note: FT the critical values from (b). Note that critical values of 9.32 and 9.68 give 0.0899.
[3 marks]
Note: In question 3, accept answers that round correctly to 2 significant figures.
METHOD 1
(M1)(A1)
P(10.2 < X < 11.4) = 0.7793… (A1)
confidence level is 77.9% A1
Note: Accept 78%.
METHOD 2
(M1)
(A1)
P(−1.224… < Z < 1.224…) = 0.7793… (A1)
confidence level is 77.9% A1
Note: Accept 78%.
[4 marks]
Examiners report
This question will connect Markov chains and directed graphs.
Abi is playing a game that involves a fair coin with heads on one side and tails on the other, together with two tokens, one with a fish’s head on it and one with a fish’s tail on it. She starts off with no tokens and wishes to win them both. On each turn she tosses the coin, if she gets a head she can claim the fish’s head token, provided that she does not have it already and if she gets a tail she can claim the fish’s tail token, provided she does not have it already. There are 4 states to describe the tokens in her possession; A: no tokens, B: only a fish’s head token, C: only a fish’s tail token, D: both tokens. So for example if she is in state B and tosses a tail she moves to state D, whereas if she tosses a head she remains in state B.
After throws the probability vector, for the 4 states, is given by where the numbers represent the probability of being in that particular state, e.g. is the probability of being in state B after throws. Initially .
Draw a transition state diagram for this Markov chain problem.
Explain why for any transition state diagram the sum of the out degrees of the directed edges from a vertex (state) must add up to +1.
Write down the transition matrix M, for this Markov chain problem.
Find the steady state probability vector for this Markov chain problem.
Explain which part of the transition state diagram confirms this.
Explain why having a steady state probability vector means that the matrix M must have an eigenvalue of .
Find .
Hence, deduce the form of .
Explain how your answer to part (f) fits with your answer to part (c).
Find the minimum number of tosses of the coin that Abi will have to make to be at least 95% certain of having finished the game by reaching state C.
Markscheme
M1A2
[3 marks]
You must leave the state along one of the edges directed out of the vertex. R1
[1 mark]
M1A2
[3 marks]
M1
since so steady state vector is . A1R1A1
[4 marks]
There is a loop with probability of 1 from state D to itself. A1
[1 mark]
Let the steady state probability vector be s then Ms = 1s showing that (\lambda = 1\) is an eigenvalue with associated eigenvector of s. A1R1
[2 marks]
A1A1A1A1
[4 marks]
A2
[2 marks]
the steady state probability vector M1R1
[2 marks]
Require (e.g. by use of table) R1M1A2
[4 marks]
Examiners report
This question explores models for the height of water in a cylindrical container as water drains out.
The diagram shows a cylindrical water container of height metres and base radius metre. At the base of the container is a small circular valve, which enables water to drain out.
Eva closes the valve and fills the container with water.
At time , Eva opens the valve. She records the height, metres, of water remaining in the container every minutes.
Eva first tries to model the height using a linear function, , where .
Eva uses the equation of the regression line of on , to predict the time it will take for all the water to drain out of the container.
Eva thinks she can improve her model by using a quadratic function, , where .
Eva uses this equation to predict the time it will take for all the water to drain out of the container and obtains an answer of minutes.
Let be the volume, in cubic metres, of water in the container at time minutes.
Let be the radius, in metres, of the circular valve.
Eva does some research and discovers a formula for the rate of change of .
Eva measures the radius of the valve to be metres. Let be the time, in minutes, it takes for all the water to drain out of the container.
Eva wants to use the container as a timer. She adjusts the initial height of water in the container so that all the water will drain out of the container in minutes.
Eva has another water container that is identical to the first one. She places one water container above the other one, so that all the water from the highest container will drain into the lowest container. Eva completely fills the highest container, but only fills the lowest container to a height of metre, as shown in the diagram.
At time Eva opens both valves. Let be the height of water, in metres, in the lowest container at time .
Find the equation of the regression line of on .
Interpret the meaning of parameter in the context of the model.
Suggest why Eva’s use of the linear regression equation in this way could be unreliable.
Find the equation of the least squares quadratic regression curve.
Find the value of .
Hence, write down a suitable domain for Eva’s function .
Show that .
By solving the differential equation , show that the general solution is given by , where .
Use the general solution from part (d) and the initial condition to predict the value of .
Find this new height.
Show that , where .
Use Euler’s method with a step length of minutes to estimate the maximum value of .
Markscheme
A1A1
Note: Award A1 for an equation in and and A1 for the coefficient and constant .
[2 marks]
EITHER
the rate of change of height (of water in metres per minute) A1
Note: Accept “rate of decrease” or “rate of increase” in place of “rate of change”.
OR
the (average) amount that the height (of the water) decreases each minute A1
[1 mark]
EITHER
unreliable to use on equation to estimate A1
OR
unreliable to extrapolate from original data A1
OR
rate of change (of height) might not remain constant (as the water drains out) A1
[1 mark]
A1
[1 mark]
(M1)
A1
[2 marks]
EITHER
A1
OR
(due to range of original data / interpolation) A1
[1 mark]
(A1)
EITHER
M1
OR
attempt to use chain rule M1
THEN
A1
AG
[3 marks]
attempt to separate variables M1
A1
A1A1
Note: Award A1 for each correct side of the equation.
A1
Note: Award the final A1 for any correct intermediate step that clearly leads to the given equation.
AG
[5 marks]
(M1)
(A1)
substituting and their non-zero value of (M1)
(minutes) A1
[4 marks]
(A1)
(M1)
(metres) A1
[3 marks]
let be the height of water in the highest container from parts (d) and (e) we get
(M1)(A1)
so M1A1
AG
[4 marks]
evidence of using Euler’s method correctly
e.g. (A1)
maximum value of (metres) (at minutes) A2
( metres)
[3 marks]
Examiners report
All parts were answered well. In part(a)(i) a few candidates lost a mark from either not writing an equation or not using the variables and . In part (a)(ii) some candidates incorrectly stated it was the rate of change of water, instead of the rate of change of the height of the water. A few weaker candidates simply stated it is the gradient of the line. In part (a)(iii) some candidates incorrectly criticized the linear model, instead of addressing the question about why it could be unreliable to use the model to make a prediction about the future.
All parts were answered well. In part(a)(i) a few candidates lost a mark from either not writing an equation or not using the variables and . In part (a)(ii) some candidates incorrectly stated it was the rate of change of water, instead of the rate of change of the height of the water. A few weaker candidates simply stated it is the gradient of the line. In part (a)(iii) some candidates incorrectly criticized the linear model, instead of addressing the question about why it could be unreliable to use the model to make a prediction about the future.
All parts were answered well. In part(a)(i) a few candidates lost a mark from either not writing an equation or not using the variables and . In part (a)(ii) some candidates incorrectly stated it was the rate of change of water, instead of the rate of change of the height of the water. A few weaker candidates simply stated it is the gradient of the line. In part (a)(iii) some candidates incorrectly criticized the linear model, instead of addressing the question about why it could be unreliable to use the model to make a prediction about the future.
This question was answered well by many candidates. In part (b)(ii) a small number of candidates incorrectly
gave two answers for , showing a lack of understanding of the context of the model.
This question was answered well by many candidates. In part (b)(ii) a small number of candidates incorrectly
gave two answers for , showing a lack of understanding of the context of the model.
This question was answered well by many candidates. In part (b)(ii) a small number of candidates incorrectly
gave two answers for , showing a lack of understanding of the context of the model.
Many candidates recognized the need to use related rates of change, but could not present coherent working to reach the given answer. Often candidates either did not appreciate the need to use the equation for the volume of a cylinder or did not simplify their equation using . Many candidates wrote nonsense arguments trying to cancel the factor of . In these long paper 3 questions, the purpose of “show that” parts is often to enable candidates to re-enter a question if they are unable to do a previous part.
Many candidates were able to correctly separate the variables, but many found the integral of to be too difficult. A common incorrect approach was to use logarithms. A surprising number also incorrectly wrote , showing a lack of understanding of the difference between a parameter and a variable. Given that most questions in this course will be set in context, it is important that candidates learn to distinguish these differences.
Generally done well.
Many candidates found this question too difficult.
Part (g)(i) was often left blank and was the worst answered question on the paper. Part (g)(ii) was answered correctly by a number of candidates, who made use of the given answer from part (g)(i).
Part (g)(i) was often left blank and was the worst answered question on the paper. Part (g)(ii) was answered correctly by a number of candidates, who made use of the given answer from part (g)(i).
This question uses statistical tests to investigate whether advertising leads to increased profits for a grocery store.
Aimmika is the manager of a grocery store in Nong Khai. She is carrying out a statistical analysis on the number of bags of rice that are sold in the store each day. She collects the following sample data by recording how many bags of rice the store sells each day over a period of days.
She believes that her data follows a Poisson distribution.
Aimmika knows from her historic sales records that the store sells an average of bags of rice each day. The following table shows the expected frequency of bags of rice sold each day during the day period, assuming a Poisson distribution with mean .
Aimmika decides to carry out a goodness of fit test at the significance level to see whether the data follows a Poisson distribution with mean .
Aimmika claims that advertising in a local newspaper for Thai Baht per day will increase the number of bags of rice sold. However, Nichakarn, the owner of the store, claims that the advertising will not increase the store’s overall profit.
Nichakarn agrees to advertise in the newspaper for the next days. During that time, Aimmika records that the store sells bags of rice with a profit of on each bag sold.
Aimmika wants to carry out an appropriate hypothesis test to determine whether the number of bags of rice sold during the days increased when compared with the historic sales records.
Find the mean and variance for the sample data given in the table.
Hence state why Aimmika believes her data follows a Poisson distribution.
State one assumption that Aimmika needs to make about the sales of bags of rice to support her belief that it follows a Poisson distribution.
Find the value of , of , and of . Give your answers to decimal places.
Write down the number of degrees of freedom for her test.
Perform the goodness of fit test and state, with reason, a conclusion.
By finding a critical value, perform this test at a significance level.
Hence state the probability of a Type I error for this test.
By considering the claims of both Aimmika and Nichakarn, explain whether the advertising was beneficial to the store.
Markscheme
mean A1
variance A1
[2 marks]
mean is close to the variance A1
[1 mark]
One of the following:
the number of bags sold each day is independent of any other day
the sale of one bag is independent of any other bag sold
the sales of bags of rice (each day) occur at a constant mean rate A1
Note: Award A1 for a correct answer in context. Any statement referring to independence must refer to either the independence of each bag sold or the independence of the number of bags sold each day. If the third option is seen, the statement must refer to a “constant mean” or “constant average”. Do not accept “the number of bags sold each day is constant”.
[1 mark]
attempt to find Poisson probabilities and multiply by (M1)
A1
A1
EITHER
(M1)
A1
OR
(M1)
A1
Note: Do not penalize the omission of clear , and labelling as this will be penalized later if correct values are interchanged.
[5 marks]
A1
[1 mark]
The number of bags of rice sold each day follows a Poisson distribution with mean . A1
The number of bags of rice sold each day does not follow a Poisson distribution with mean . A1
Note: Award A1A1 for both hypotheses correctly stated and in correct order. Award A1A0 if reference to the data and/or “mean ” is not included in the hypotheses, but otherwise correct.
evidence of attempting to group data to obtain the observed frequencies for and (M1)
-value A2
R1
the result is not significant so there is no reason to reject (the number of bags sold each day follows a Poisson distribution) A1
Note: Do not award R0A1. The conclusion MUST follow through from their hypotheses. If no hypotheses are stated, the final A1 can still be awarded for a correct conclusion as long as it is in context (e.g. therefore the data follows a Poisson distribution).
[7 marks]
METHOD 1
evidence of multiplying (seen anywhere) M1
A1
Note: Accept and for the A1.
evidence of finding probabilities around critical region (M1)
Note: Award (M1) for any of these values seen:
OR
OR
OR
critical value A1
, R1
the null hypothesis is rejected A1
(the advertising increased the number of bags sold during the days)
Note: Do not award R0A1. Accept statements referring to the advertising being effective for A1 as long as the R mark is satisfied. For the R1A1, follow through within the part from their critical value.
METHOD 2
evidence of dividing by (or seen anywhere) M1
A1
attempt to find critical value using central limit theorem (M1)
(e.g. sample standard deviation , etc.)
Note: Award (M1) for a -value of seen.
critical value A1
R1
the null hypothesis is rejected A1
(the advertising increased the number of bags sold during the days)
Note: Do not award R0A1. Accept statements referring to the advertising being effective for A1 as long as the R mark is satisfied. For the R1A1, follow through within the part from their critical value.
[6 marks]
A1
Note: If a candidate uses METHOD 2 in part (e)(i), allow an FT answer of for this part but only if the candidate has attempted to find a -value.
[1 mark]
attempt to compare profit difference with cost of advertising (M1)
Note: Award (M1) for evidence of candidate mathematically comparing a profit difference with the cost of the advertising.
EITHER
(comparing profit from extra bags of rice with cost of advertising)
A1
OR
(comparing total profit with and without advertising)
A1
OR
(comparing increase of average daily profit with daily advertising cost)
A1
THEN
EITHER
Even though the number of bags of rice increased, the advertising is not worth it as the overall profit did not increase. R1
OR
The advertising is worth it even though the cost is less than the increased profit, since the number of customers increased (possibly buying other products and/or returning in the future after advertising stops) R1
Note: Follow through within the part for correct reasoning consistent with their comparison.
[3 marks]
Examiners report
Candidates generally did well in finding the mean, although some wasted time by calculating it by hand rather than by using their GDC. Many candidates were able to find a correct variance. However, there were also many who gave the standard deviation as their variance or simply made the variance the same as their mean without performing a calculation, possibly looking ahead to part (a)(ii). Many candidates successfully used the clue given by the command term “hence” and correctly answered (a)(ii).
It was clear candidates understood that independence was the key term needed in the response. However, a number of candidates struggled either by not being precise enough in their responses (e.g. simply stating “they are independent”) or by incorrectly stating that the bags of rice sold must be independent of the number of days. “Communication” is an assessment objective for the course, and candidates should aim for clarity in their responses thereby ensuring the examiner can be confident in awarding credit.
This question was generally done well by the candidates. The two most common mistakes both stemmed from candidates not paying attention to the instructions given. Candidates either did not give their answer correct to 3 decimal places or they incorrectly attempted to use the normal distribution to find the expected frequencies. Another frequent mistake involved candidates multiplying their probabilities by 100 rather than by 90.
While most candidates were able to gain the mark in part (i), many candidates arrived at a correct answer but by using the incorrect 𝜒2 test for independence method of determining the degrees of freedom. In part (ii), several common mistakes led to very few candidates receiving the full seven marks for this question part. Candidates struggled to correctly write the hypotheses as they either wrote them in reverse order or they did not correctly reference the data and/or a Poisson distribution with mean 4.2. Unfortunately, some did not state the hypotheses at all. Another common mistake involved candidates incorrectly combining columns to create a column for days when at least 7 bags of rice were sold, possibly from incorrectly thinking that an observed value less than 5 is not allowed when carrying out goodness of fit test. Many candidates also missed out on possible follow through marks for their p-value by not fully writing out the observed and expected values they were inputting into their GDC. Although communication was not being assessed here, it highlights how it is easier to credit a correct method (even if leading to an incorrect answer) if there is appropriate working and/or a running commentary present.
In contrast to part 1(d) where a method was given, many candidates struggled to know how to find a critical value in this question part. Although it was possible for candidates to find a critical value by either using Poisson probabilities or probabilities from a normal approximation, few knew how to begin. As a result, very few candidates scored full marks here.
Very few candidates managed to get full marks in this question part. Again, without being guided into a particular method, candidates struggled to understand how to begin the problem. Many candidates did not realize that some calculations were necessary to give a proper conclusion. A subset of these candidates thought it was a continuation of part (e) and made some statement related to their conclusion from the hypothesis test. For the candidates that did try to make some calculations, many simply calculated the profit from selling 282 bags of rice and compared that to the cost of the advertising. These candidates did not realize they needed to compare the difference in the expected profit without advertising and the actual profit with advertising.
This question explores methods to analyse the scores in an exam.
A random sample of 149 scores for a university exam are given in the table.
The university wants to know if the scores follow a normal distribution, with the mean and variance found in part (a).
The expected frequencies are given in the table.
The university assigns a pass grade to students whose scores are in the top 80%.
The university also wants to know if the exam is gender neutral. They obtain random samples of scores for male and female students. The mean, sample variance and sample size are shown in the table.
The university awards a distinction to students who achieve high scores in the exam. Typically, 15% of students achieve a distinction. A new exam is trialed with a random selection of students on the course. 5 out of 20 students achieve a distinction.
A different exam is trialed with 16 students. Let be the percentage of students achieving a distinction. It is desired to test the hypotheses
against
It is decided to reject the null hypothesis if the number of students achieving a distinction is greater than 3.
Find unbiased estimates for the population mean.
Find unbiased estimates for the population Variance.
Show that the expected frequency for 20 < ≤ 4 is 31.5 correct to 1 decimal place.
Perform a suitable test, at the 5% significance level, to determine if the scores follow a normal distribution, with the mean and variance found in part (a). You should clearly state your hypotheses, the degrees of freedom, the p-value and your conclusion.
Use the normal distribution model to find the score required to pass.
Perform a suitable test, at the 5% significance level, to determine if there is a difference between the mean scores of males and females. You should clearly state your hypotheses, the p-value and your conclusion.
Perform a suitable test, at the 5% significance level, to determine if it is easier to achieve a distinction on the new exam. You should clearly state your hypotheses, the critical region and your conclusion.
Find the probability of making a Type I error.
Given that find the probability of making a Type II error.
Markscheme
52.8 A1
[1 mark]
M1A1
[2 marks]
M1A1
M1
= 31.5 AG
[3 marks]
use of a goodness of fit test M1
and A1A1
A1
p-value = 0.569 A2
Since 0.569 > 0.05 R1
Insufficient evidence to reject . The scores follow a normal distribution. A1
[8 marks]
M1A1
[2 marks]
use of a t-test M1
and A1
p-value = 0.180 A2
Since 0.180 > 0.05 R1
Insufficient evidence to reject . There is no difference between males and females. A1
[6 marks]
use of test for proportion using Binomial distribution M1
and A1
and M1
So the critical region is A1
Since 5 < 7 R1
Insufficient evidence to reject . It is not easier to achieve a distinction on the new exam. A1
[6 marks]
using M1
M1A1
[3 marks]
using M1
M1A1
[3 marks]
Examiners report
Juliet is a sociologist who wants to investigate if income affects happiness amongst doctors. This question asks you to review Juliet’s methods and conclusions.
Juliet obtained a list of email addresses of doctors who work in her city. She contacted them and asked them to fill in an anonymous questionnaire. Participants were asked to state their annual income and to respond to a set of questions. The responses were used to determine a happiness score out of . Of the doctors on the list, replied.
Juliet’s results are summarized in the following table.
For the remaining ten responses in the table, Juliet calculates the mean happiness score to be .
Juliet decides to carry out a hypothesis test on the correlation coefficient to investigate whether increased annual income is associated with greater happiness.
Juliet wants to create a model to predict how changing annual income might affect happiness scores. To do this, she assumes that annual income in dollars, , is the independent variable and the happiness score, , is the dependent variable.
She first considers a linear model of the form
.
Juliet then considers a quadratic model of the form
.
After presenting the results of her investigation, a colleague questions whether Juliet’s sample is representative of all doctors in the city.
A report states that the mean annual income of doctors in the city is . Juliet decides to carry out a test to determine whether her sample could realistically be taken from a population with a mean of .
Describe one way in which Juliet could improve the reliability of her investigation.
Describe one criticism that can be made about the validity of Juliet’s investigation.
Juliet classifies response as an outlier and removes it from the data. Suggest one possible justification for her decision to remove it.
Calculate the mean annual income for these remaining responses.
Determine the value of , Pearson’s product-moment correlation coefficient, for these remaining responses.
State why the hypothesis test should be one-tailed.
State the null and alternative hypotheses for this test.
The critical value for this test, at the significance level, is . Juliet assumes that the population is bivariate normal.
Determine whether there is significant evidence of a positive correlation between annual income and happiness. Justify your answer.
Use Juliet’s data to find the value of and of .
Interpret, referring to income and happiness, what the value of represents.
Find the value of , of and of .
Find the coefficient of determination for each of the two models she considers.
Hence compare the two models.
Juliet decides to use the coefficient of determination to choose between these two models.
Comment on the validity of her decision.
State the name of the test which Juliet should use.
State the null and alternative hypotheses for this test.
Perform the test, using a significance level, and state your conclusion in context.
Markscheme
Any one from: R1
increase sample size / increase response rate / repeat process
check whether sample is representative
test-retest participants or do a parallel test
use a stratified sample
use a random sample
Note: Do not condone:
Ask different types of doctor
Ask for proof of income
Ask for proof of being a doctor
Remove anonymity
Remove response .
[1 mark]
Any one from: R1
non-random sampling means a subset of population might be responding
self-reported happiness is not the same as happiness
happiness is not a constant / cannot be quantified / is difficult to measure
income might include external sources
Juliet is only sampling doctors in her city
correlation does not imply causation
sample might be biased
Note: Do not condone the following common but vague responses unless they make a clear link to validity:
Sample size is too small
Result is not generalizable
There may be other variables Juliet is ignoring
Sample might not be representative
[1 mark]
because the income is very different / implausible / clearly contrived R1
Note: Answers must explicitly reference "income" to get credit.
[1 mark]
(M1)A1
[2 marks]
A2
[2 marks]
EITHER
only looking for change in one direction R1
OR
only looking for greater happiness with greater income R1
OR
only looking for evidence of positive correlation R1
[1 mark]
A1A1
Note: Award A1 for seen (do not accept ), A1 for both correct hypotheses, using their or . Accept an equivalent statement in words, however reference to “correlation for the population” or “association for the population” must be explicit for the first A1 to be awarded.
Watch out for a null hypothesis in words similar to “Annual income is not associated with greater happiness”. This is effectively saying and should not be condoned.
[2 marks]
METHOD 1 – using critical value of
R1
(therefore significant evidence of) a positive correlation A1
Note: Do not award R0A1.
METHOD 2 – using -value
A1
Note: Follow through from their -value from part (c)(ii).
(therefore significant evidence of) a positive correlation A1
Note: Do not award A0A1.
[2 marks]
A1
[1 mark]
EITHER
the amount the happiness score increases for every increase in (annual) income A1
OR
rate of change of happiness with respect to (annual) income A1
Note: Accept equivalent responses e.g. an increase of in happiness for every increase in salary.
[1 mark]
,
,
A1
[1 mark]
for quadratic model: A1
for linear model: A1
Note: Follow through from their value from part (c)(ii).
[2 marks]
EITHER
quadratic model is a better fit to the data / more accurate A1
OR
quadratic model explains a higher proportion of the variance A1
[1 mark]
EITHER
not valid, not a useful measure to compare models with different numbers of parameters A1
OR
not valid, quadratic model will always have a better fit than a linear model A1
Note: Accept any other sensible critique of the validity of the method. Do not accept any answers which focus on the conclusion rather than the method of model selection.
[1 mark]
(single sample) -test A1
[1 mark]
EITHER
A1
OR
(sample is drawn from a population where) the population mean is
the population mean is not A1
Note: Do not allow FT from an incorrect test in part (f)(i) other than a -test.
[1 mark]
A1
Note: For a -test follow through from part (f)(i), either (from biased estimate of variance) or (from unbiased estimate of variance).
R1
EITHER
no (significant) evidence that mean differs from A1
OR
the sample could plausibly have been drawn from the quoted population A1
Note: Allow R1FTA1FT from an incorrect -value, but the final A1 must still be in the context of the original research question.
[3 marks]
Examiners report
A random variable has a distribution with mean and variance 4. A random sample of size 100 is to be taken from the distribution of .
Josie takes a different random sample of size 100 to test the null hypothesis that against the alternative hypothesis that at the 5 % level.
State the central limit theorem as applied to a random sample of size , taken from a distribution with mean and variance .
Jack takes a random sample of size 100 and calculates that . Find an approximate 90 % confidence interval for .
Find the critical region for Josie’s test, giving your answer correct to two decimal places.
Write down the probability that Josie makes a Type I error.
Given that the probability that Josie makes a Type II error is 0.25, find the value of , giving your answer correct to three significant figures.
Markscheme
for (sufficiently) large the sample mean approximately A1
A1
Note: Award the first A1 for large and reference to the sample mean , the second A1 is for normal and the two parameters.
Note: Award the second A1 only if the first A1 is awarded.
Note: Allow ‘ tends to infinity’ or ‘ ≥ 30’ in place of ‘large’.
[2 marks]
[59.9, 60.5] A1A1
Note: Accept answers which round to the correct 3sf answers.
[2 marks]
under , (A1)
required to find such that (M1)
use of any valid method, eg GDC Inv(Normal) or (M1)
hence critical region is A1
[4 marks]
0.05 A1
[1 mark]
(Type II error) = ( is accepted / is false) (R1)
Note: Accept Type II error means is accepted given is false.
when (M1)
(M1)
where
(A1)
A1
[5 marks]
Examiners report
A firm wishes to review its recruitment processes. This question considers the validity and reliability of the methods used.
Every year an accountancy firm recruits new employees for a trial period of one year from a large group of applicants.
At the start, all applicants are interviewed and given a rating. Those with a rating of either Excellent, Very good or Good are recruited for the trial period. At the end of this period, some of the new employees will stay with the firm.
It is decided to test how valid the interview rating is as a way of predicting which of the new employees will stay with the firm.
Data is collected and recorded in a contingency table.
The next year’s group of applicants are asked to complete a written assessment which is then analysed. From those recruited as new employees, a random sample of size is selected.
The sample is stratified by department. Of the new employees recruited that year, were placed in the national department and in the international department.
At the end of their first year, the level of performance of each of the employees in the sample is assessed by their department manager. They are awarded a score between (low performance) and (high performance).
The marks in the written assessment and the scores given by the managers are shown in both the table and the scatter diagram.
The firm decides to find a Spearman’s rank correlation coefficient, , for this data.
The same seven employees are given the written assessment a second time, at the end of the first year, to measure its reliability. Their marks are shown in the table below.
The written assessment is in five sections, numbered to . At the end of the year, the employees are also given a score for each of five professional attributes: and .
The firm decides to test the hypothesis that there is a correlation between the mark in a section and the score for an attribute.
They compare marks in each of the sections with scores for each of the attributes.
Use an appropriate test, at the significance level, to determine whether a new employee staying with the firm is independent of their interview rating. State the null and alternative hypotheses, the -value and the conclusion of the test.
Show that employees are selected for the sample from the national department.
Without calculation, explain why it might not be appropriate to calculate a correlation coefficient for the whole sample of employees.
Find for the seven employees working in the international department.
Hence comment on the validity of the written assessment as a measure of the level of performance of employees in this department. Justify your answer.
State the name of this type of test for reliability.
For the data in this table, test the null hypothesis, , against the alternative hypothesis, , at the significance level. You may assume that all the requirements for carrying out the test have been met.
Hence comment on the reliability of the written assessment.
Write down the number of tests they carry out.
The tests are performed at the significance level.
Assuming that:
- there is no correlation between the marks in any of the sections and scores in any of the attributes,
- the outcome of each hypothesis test is independent of the outcome of the other hypothesis tests,
find the probability that at least one of the tests will be significant.
The firm obtains a significant result when comparing section of the written assessment and attribute . Interpret this result.
Markscheme
Use of test for independence (M1)
Staying (or leaving) the firm and interview rating are independent.
Staying (or leaving) the firm and interview rating are not independent A1
Note: For accept ‘…are dependent’ in place of ‘…not independent’.
-value A2
Note: Award A1 for if -value is omitted or incorrect.
R1
(the result is not significant at the level)
insufficient evidence to reject the (or “accept ”) A1
Note: Do not award R0A1. The final R1A1 can follow through from their incorrect -value
[6 marks]
M1A1
Note: Award A1 for anything that rounds to .
AG
[2 marks]
there seems to be a difference between the two departments (A1)
the international department manager seems to be less generous than the national department manager R1
Note: The A1 is for commenting there is a difference between the two departments and the R1 is for correctly commenting on the direction of the difference
[2 marks]
(M1)(A1)
Note: Award (M1) for an attempt to rank the data, and (A1) for correct ranks for both variables. Accept either set of rankings in reverse.
(M1)(A1)
Note: The (M1) is for calculating the PMCC for their ranks.
Note: If a final answer of is seen, from use of , award (M1)(A1)A1.
Accept if one set of ranks has been ordered in reverse.
[4 marks]
EITHER
there is a (strong) association between the written assessment mark and the manager scores. A1
OR
there is a (strong) agreement in the rank order of the written assessment marks and the rank order of the manager scores. A1
OR
there is a (strong linear) correlation between the rank order of the written assessment marks and the rank order of the manager scores. A1
Note: Follow through on a value for their value of in c(ii).
THEN
the written assessment is likely to be a valid measure (of the level of employee performance) R1
[2 marks]
test-retest A1
[1 mark]
-value A2
R1
(the result is significant at the level)
(there is sufficient evidence to) reject A1
Note: Do not award R0A1. Accept “accept ”. The final R1A1 can follow through from their incorrect -value.
[4 marks]
the test seems reliable A1
Note: Follow through from their answer in part (d)(ii). Do not award if there is no conclusion in d(ii).
[1 mark]
A1
[1 mark]
probability of significant result given no correlation is (M1)
probability of at least one significant result in tests is
(M1)(A1)
Note: Award (M1) for use of or the binomial distribution with any value of .
A1
[4 marks]
(though the result is significant) it is very likely that one significant result would be achieved by chance, so it should be disregarded or further evidence sought R1
[1 mark]
Examiners report
The random variables follow a bivariate normal distribution with product moment correlation coefficient .
A random sample of 12 observations on U, V is obtained to determine whether there is a correlation between U and V. The sample product moment correlation coefficient is denoted by r. A test to determine whether or not U, V are independent is carried out at the 1% level of significance.
State suitable hypotheses to investigate whether or not , are independent.
Find the least value of for which the test concludes that .
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
A1A1
[2 marks]
(A1)
(M1)(A1)
we reject if (R1)
attempting to solve for M1
Note: Allow = instead of >.
(least value of is) 0.708 (3 sf) A1
Note: Award A1M1A0R1M1A0 to candidates who use a one-tailed test. Award A0M1A0R1M1A0 to candidates who use an incorrect number of degrees of freedom or both a one-tailed test and incorrect degrees of freedom.
Note: Possible errors are
10 DF 1-tail, , least value 0.658
11 DF 2-tail, , least value 0.684
11 DF 1-tail, , least value 0.634.
[6 marks]
Examiners report
A farmer sells bags of potatoes which he states have a mean weight of 7 kg . An inspector, however, claims that the mean weight is less than 7 kg . In order to test this claim, the inspector takes a random sample of 12 of these bags and determines the weight, kg , of each bag. He finds that You may assume that the weights of the bags of potatoes can be modelled by the normal distribution .
State suitable hypotheses to test the inspector’s claim.
Find unbiased estimates of and .
Carry out an appropriate test and state the -value obtained.
Using a 10% significance level and justifying your answer, state your conclusion in context.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
A1
[1 mark]
A1
(M1)A1
[3 marks]
(M1)(A1)
(A1)
A1
Note: Accept any answer that rounds correctly to 0.12.
[4 marks]
because R1
the inspector’s claim is not supported (at the 10% level)
(or equivalent in context) A1
Note: Only award the A1 if the R1 has been awarded
[2 marks]
Examiners report
Two IB schools, A and B, follow the IB Diploma Programme but have different teaching methods. A research group tested whether the different teaching methods lead to a similar final result.
For the test, a group of eight students were randomly selected from each school. Both samples were given a standardized test at the start of the course and a prediction for total IB points was made based on that test; this was then compared to their points total at the end of the course.
Previous results indicate that both the predictions from the standardized tests and the final IB points can be modelled by a normal distribution.
It can be assumed that:
- the standardized test is a valid method for predicting the final IB points
- that variations from the prediction can be explained through the circumstances of the student or school.
The data for school A is shown in the following table.
For each student, the change from the predicted points to the final points was calculated.
The data for school B is shown in the following table.
School A also gives each student a score for effort in each subject. This effort score is based on a scale of 1 to 5 where 5 is regarded as outstanding effort.
It is claimed that the effort put in by a student is an important factor in improving upon their predicted IB points.
A mathematics teacher in school A claims that the comparison between the two schools is not valid because the sample for school B contained mainly girls and that for school A, mainly boys. She believes that girls are likely to show a greater improvement from their predicted points to their final points.
She collects more data from other schools, asking them to class their results into four categories as shown in the following table.
Identify a test that might have been used to verify the null hypothesis that the predictions from the standardized test can be modelled by a normal distribution.
State why comparing only the final IB points of the students from the two schools would not be a valid test for the effectiveness of the two different teaching methods.
Find the mean change.
Find the standard deviation of the changes.
Use a paired -test to determine whether there is significant evidence that the students in school A have improved their IB points since the start of the course.
Use an appropriate test to determine whether there is evidence, at the 5 % significance level, that the students in school B have improved more than those in school A.
State why it was important to test that both sets of points were normally distributed.
Perform a test on the data from school A to show it is reasonable to assume a linear relationship between effort scores and improvements in IB points. You may assume effort scores follow a normal distribution.
Hence, find the expected improvement between predicted and final points for an increase of one unit in effort grades, giving your answer to one decimal place.
Use an appropriate test to determine whether showing an improvement is independent of gender.
If you were to repeat the test performed in part (e) intending to compare the quality of the teaching between the two schools, suggest two ways in which you might choose your sample to improve the validity of the test.
Markscheme
(goodness of fit) A1
[1 mark]
EITHER
because aim is to measure improvement
OR
because the students may be of different ability in the two schools R1
[1 mark]
0.1875 (accept 0.188, 0.19) A1
[1 mark]
2.46 (M1)A1
Note: Award (M1)A0 for 2.63.
[2 marks]
: there has been no improvement
: there has been an improvement A1
attempt at a one-tailed paired -test (M1)
-value = 0.423 A1
there is no significant evidence that the students have improved R1
Note: If the hypotheses are not stated award a maximum of A0M1A1R0.
[4 marks]
: there is no difference between the schools
: school B did better than school A A1
one-tailed 2 sample -test (M1)
-value = 0.0984 A1
0.0984 > 0.05 (not significant at the 5 % level) so do not reject the null hypothesis R1A1
Note: The final A1 cannot be awarded following an incorrect reason. The final R1A1 can follow through from their incorrect -value. Award a maximum of A1(M1)A0R1A1 for -value = 0.0993.
[5 marks]
sample too small for the central limit theorem to apply (and -tests assume normal distribution) R1
[1 mark]
:
: A1
Note: Allow hypotheses to be expressed in words.
-value = 0.00157 A1
(0.00157 < 0.01) there is a significant evidence of a (linear) correlation between effort and improvement (so it is reasonable to assume a linear relationship) R1
[3 marks]
(gradient of line of regression =) 6.6 A1
[1 mark]
: improvement and gender are independent
: improvement and gender are not independent A1
choice of test for independence (M1)
groups first two columns as expected values in first column less than 5 M1
new observed table
(A1)
-value = 0.581 A1
no significant evidence that gender and improvement are dependent R1
[6 marks]
For example:
larger samples / include data from whole school
take equal numbers of boys and girls in each sample
have a similar range of abilities in each sample
(if possible) have similar ranges of effort R1R1
Note: Award R1 for each reasonable suggestion to improve the validity of the test.
[2 marks]
Examiners report
The weights, X kg, of the males of a species of bird may be assumed to be normally distributed with mean 4.8 kg and standard deviation 0.2 kg.
The weights, Y kg, of female birds of the same species may be assumed to be normally distributed with mean 2.7 kg and standard deviation 0.15 kg.
Find the probability that a randomly chosen male bird weighs between 4.75 kg and 4.85 kg.
Find the probability that the weight of a randomly chosen male bird is more than twice the weight of a randomly chosen female bird.
Two randomly chosen male birds and three randomly chosen female birds are placed on a weighing machine that has a weight limit of 18 kg. Find the probability that the total weight of these five birds is greater than the weight limit.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
Note: In question 1, accept answers that round correctly to 2 significant figures.
P(4.75 < X < 4.85) = 0.197 A1
[1 mark]
Note: In question 1, accept answers that round correctly to 2 significant figures.
consider the random variable X − 2Y (M1)
E(X − 2Y) = − 0.6 (A1)
Var(X − 2Y) = Var(X) + 4Var(Y) (M1)
= 0.13 (A1)
X − 2Y ∼ N(−0.6, 0.13)
P(X − 2Y > 0) (M1)
= 0.0480 A1
[6 marks]
Note: In question 1, accept answers that round correctly to 2 significant figures.
let W = X1 + X2 + Y1 + Y2 + Y3 be the total weight
E(W) = 17.7 (A1)
Var(W) = 2Var(X) + 3Var(Y) = 0.1475 (M1)(A1)
W ∼ N(17.7, 0.1475)
P(W > 18) = 0.217 A1
[4 marks]
Examiners report
Mr Sailor owns a fish farm and he claims that the weights of the fish in one of his lakes have a mean of 550 grams and standard deviation of 8 grams.
Assume that the weights of the fish are normally distributed and that Mr Sailor’s claim is true.
Kathy is suspicious of Mr Sailor’s claim about the mean and standard deviation of the weights of the fish. She collects a random sample of fish from this lake whose weights are shown in the following table.
Using these data, test at the 5% significance level the null hypothesis against the alternative hypothesis , where grams is the population mean weight.
Kathy decides to use the same fish sample to test at the 5% significance level whether or not there is a positive association between the weights and the lengths of the fish in the lake. The following table shows the lengths of the fish in the sample. The lengths of the fish can be assumed to be normally distributed.
Find the probability that a fish from this lake will have a weight of more than 560 grams.
The maximum weight a hand net can hold is 6 kg. Find the probability that a catch of 11 fish can be carried in the hand net.
State the distribution of your test statistic, including the parameter.
Find the p-value for the test.
State the conclusion of the test, justifying your answer.
State suitable hypotheses for the test.
Find the product-moment correlation coefficient .
State the p-value and interpret it in this context.
Use an appropriate regression line to estimate the weight of a fish with length 360 mm.
Markscheme
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
(550, 82) (M1)
A1
[2 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
(550, 82), ,…, 11
let
A1
(M1)A1
A1
[4 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
distribution with 7 degrees of freedom A1A1
[2 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
p = 0.25779…= 0.258 A2
[2 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
p > 0.05 R1
therefore we conclude that there is no evidence to reject A1
Note: FT their p-value.
Note: Only award A1 if R1 awarded.
[2 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
, A1
Note: Do not accept in place of .
[1 mark]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
= 0.782 A2
[2 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
0.01095… = 0.0110 A1
since 0.0110 < 0.05 R1
there is positive association between weight and length A1
Note: FT their p-value.
Note: Only award A1 if R1 awarded.
Note: Conclusion must be in context.
[3 marks]
Note: Accept all answers that round to the correct 2sf answer in (a), (b) and (c) but not in (d).
regression line of (weight) on (length) is (M1)
= 0.8267… + 255.96… (A1)
= 360 gives = 554 A1
Note: Award M1A0A0 for the wrong regression line, that is = 0.7393… – 51.62….
[3 marks]
Examiners report
The times , in minutes, taken by a random sample of 75 workers of a company to travel to work can be summarized as follows
, .
Let be the random variable that represents the time taken to travel to work by a worker of this company.
Find unbiased estimates of the mean of .
Find unbiased estimates of the variance of .
Assuming that is normally distributed, find
(i) the 90% confidence interval for the mean time taken to travel to work by the workers of this company,
(ii) the 95% confidence interval for the mean time taken to travel to work by the workers of this company.
Before seeing these results the managing director believed that the mean time was 26 minutes.
Explain whether your answers to part (b) support her belief.
Markscheme
A1
[1 mark]
(M1)A1
Note: Accept all answers that round to 28.9 and 189.
Note: Award M0 if division by 75.
[2 marks]
attempting to find a confidence interval. (M1)
(i) 90% interval: (26.2, 31.5) A1
(ii) 95% interval: (25.7, 32.0) A1
Note: Accept any values which round to within 0.1 of the correct value.
Note: Award M1A1A0 if only confidence limits are given in the form 28.9 ± 2.6.
[3 marks]
26 lies within the 95% interval but not within the 90% interval R1
Note: Award R1 for considering whether or not one or two of the intervals contain 26.
the belief is supported at the 5% level (accept 95%) A1
the belief is not supported at the 10% level (accept 90%) A1
Note: FT their intervals but award R1A1A0 if both intervals give the same conclusion.
[3 marks]
Examiners report
Anne is a farmer who grows and sells pumpkins. Interested in the weights of pumpkins produced, she records the weights of eight pumpkins and obtains the following results in kilograms.
Assume that these weights form a random sample from a distribution.
Anne claims that the mean pumpkin weight is 7.5 kilograms. In order to test this claim, she sets up the null hypothesis .
Determine unbiased estimates for and .
Use a two-tailed test to determine the -value for the above results.
Interpret your -value at the 5% level of significance, justifying your conclusion.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
UE of is A1
UE of is 0.404 (M1)A1
Note: Accept answers that round correctly to 2 sf.
Note: Condone incorrect notation, ie, instead of UE of and instead of UE of .
Note: M0 for squaring giving 0.354, M1A0 for failing to square
[3 marks]
attempting to use the -test (M1)
-value is 0.0566 A2
Note: Accept any answer that rounds correctly to 2 sf.
[3 marks]
R1
we accept the null hypothesis (mean pumpkin weight is 7.5 kg) A1
Note: Apply follow through on the candidate’s -value.
Note: Do not award A1 if R1 is not awarded.
[2 marks]
Examiners report
A shop sells carrots and broccoli. The weights of carrots can be modelled by a normal distribution with variance and the weights of broccoli can be modelled by a normal distribution with variance . The shopkeeper claims that the mean weight of carrots is and the mean weight of broccoli is .
Dong Wook decides to investigate the shopkeeper’s claim that the mean weight of carrots is . He plans to take a random sample of carrots in order to calculate a confidence interval for the population mean weight.
Anjali thinks the mean weight, , of the broccoli is less than . She decides to perform a hypothesis test, using a random sample of size . Her hypotheses are
.
She decides to reject if the sample mean is less than .
Assuming that the shopkeeper’s claim is correct, find the probability that the weight of six randomly chosen carrots is more than two times the weight of one randomly chosen broccoli.
Find the least value of required to ensure that the width of the confidence interval is less than .
Find the significance level for this test.
Given that the weights of the broccoli actually follow a normal distribution with mean and variance , find the probability of Anjali making a Type II error.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
Let M1
(M1)(A1)
(M1)(A1)
A1
Note: Condone the notation only if the (M1) is awarded for the variance.
[6 marks]
(A1)
M1
A1
Note: Condone the use of equal signs.
[3 marks]
variance (A1)
under
significance level (M1)
or A1
Note: Accept any answer that rounds to or .
[3 marks]
Type II error probability (M1)
(A1)
A1
Note: Accept any answer that rounds to .
[3 marks]
Examiners report
Two independent random variables and follow Poisson distributions.
Given that and , calculate
.
Var.
.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
(M1)A1
[2 marks]
Var and Var (R1)
(M1)
= 84 A1
[3 marks]
use of (M1)
; A1
(M1)
= −8 A1
[4 marks]
Examiners report
This question is about modelling the spread of a computer virus to predict the number of computers in a city which will be infected by the virus.
A systems analyst defines the following variables in a model:
- is the number of days since the first computer was infected by the virus.
- is the total number of computers that have been infected up to and including day .
The following data were collected:
A model for the early stage of the spread of the computer virus suggests that
where is the total number of computers in a city and is a measure of how easily the virus is spreading between computers. Both and are assumed to be constant.
The data above are taken from city X which is estimated to have million computers.
The analyst looks at data for another city, Y. These data indicate a value of .
An estimate for , can be found by using the formula:
.
The following table shows estimates of for city X at different values of .
An improved model for , which is valid for large values of , is the logistic differential equation
where and are constants.
Based on this differential equation, the graph of against is predicted to be a straight line.
Find the equation of the regression line of on .
Write down the value of , Pearson’s product-moment correlation coefficient.
Explain why it would not be appropriate to conduct a hypothesis test on the value of found in (a)(ii).
Find the general solution of the differential equation .
Using the data in the table write down the equation for an appropriate non-linear regression model.
Write down the value of for this model.
Hence comment on the suitability of the model from (b)(ii) in comparison with the linear model found in part (a).
By considering large values of write down one criticism of the model found in (b)(ii).
Use your answer from part (b)(ii) to estimate the time taken for the number of infected computers to double.
Find in which city, X or Y, the computer virus is spreading more easily. Justify your answer using your results from part (b).
Determine the value of and of . Give your answers correct to one decimal place.
Use linear regression to estimate the value of and of .
The solution to the differential equation is given by
where is a constant.
Using your answer to part (f)(i), estimate the percentage of computers in city X that are expected to have been infected by the virus over a long period of time.
Markscheme
A1A1
Note: Award at most A1A0 if answer is not an equation. Award A1A0 for an answer including either or .
[2 marks]
A1
[1 mark]
is not a random variable OR it is not a (bivariate) normal distribution
OR data is not a sample from a population
OR data appears nonlinear
OR only measures linear correlation R1
Note: Do not accept “ is not large enough”.
[1 mark]
attempt to separate variables (M1)
A1A1A1
Note: Award A1 for LHS, A1 for , and A1 for .
Award full marks for OR .
Award M1A1A1A0 for
[4 marks]
attempt at exponential regression (M1)
A1
OR
attempt at exponential regression (M1)
A1
Note: Condone answers involving or . Condone absence of “” Award M1A0 for an incorrect answer in correct format.
[2 marks]
A1
[1 mark]
comparing something to do with and something to do with M1
Note: Examples of where the M1 should be awarded:
The “correlation coefficient” in the exponential model is larger.
Model B has a larger
Examples of where the M1 should not be awarded:
The exponential model shows better correlation (since not clear how it is being measured)
Model 2 has a better fit
Model 2 is more correlated
an unambiguous comparison between and or and leading to the conclusion that the model in part (b) is more suitable / better A1
Note: Condone candidates claiming that is the “correlation coefficient” for the non-linear model.
[2 marks]
it suggests that there will be more infected computers than the entire population R1
Note: Accept any response that recognizes unlimited growth.
[1 mark]
OR OR OR using the model to find two specific times with values of which double M1
(days) A1
Note: Do not FT from a model which is not exponential. Award M0A0 for an answer of which comes from using from the data or any other answer which finds a doubling time from figures given in the table.
[2 marks]
an attempt to calculate for city X (M1)
OR
A1
this is larger than so the virus spreads more easily in city X R1
Note: It is possible to award M1A0R1.
Condone “so the virus spreads faster in city X” for the final R1.
[3 marks]
A1A1
Note: Award A1A0 if values are correct but not to dp.
[2 marks]
(A1)(A1)
Note: Award A1 for each coefficient seen – not necessarily in the equation. Do not penalize seeing in the context of and .
identifying that the constant is OR that the gradient is (M1)
therefore A1
A1
Note: Accept a value of of from use of sf value of , or any other value from plausible pre-rounding.
Allow follow-through within the question part, from the equation of their line to the final two A1 marks.
[5 marks]
recognizing that their is the eventual number of infected (M1)
A1
Note: Accept any final answer consistent with their answer to part (f)(i) unless their is less than in which case award at most M1A0.
[2 marks]
Examiners report
A significant minority were unable to attempt 1(a) which suggests poor preparation for the use of the GDC in this statistics-heavy course. Large numbers of candidates appeared to use and interchangeably. Accurate use of notation is an important skill which needs to be developed.
1(a)(iii) was a question at the heart of the Applications and interpretations course. In modern statistics many of the calculations are done by a computer so the skill of the modern statistician lies in knowing which tests are appropriate and how to interpret the results. Very few candidates seemed familiar with the assumptions required for the use of the standard test on the correlation coefficient. Indeed, many candidates answered this by claiming that the value was either too large or too small to do a hypothesis test, indicating a major misunderstanding of the purpose of hypothesis tests.
1(b)(i) was done very poorly. It seems that perhaps adding parameters to the equation confused many candidates – if the equation had been many more would have successfully attempted this. However, the presence of parameters is a fundamental part of mathematical modelling so candidates should practise working with expressions involving them.
1(b)(ii) and (iii) were done relatively well, with many candidates using the data to recognize an exponential model was a good idea. Part (iv) was often communicated poorly. Many candidates might have done the right thing in their heads but just writing that the correlation was better did not show which figures were being compared. Many candidates who did write down the numbers made it clear that they were comparing an value with an value.
1(c) was not meant to be such a hard question. There is a standard formula for half-life which candidates were expected to adapt. However, large numbers of candidates conflated the data and the model, finding the time for one of the data points (which did not lie on the model curve) to double. Candidates also thought that the value of t found was equivalent to the doubling time, often giving answers of around 40 days which should have been obviously wrong.
1(d) was quite tough. Several candidates realized that was the required quantity to be compared but very few could calculate for city X using the given information.
1(e) was meant to be relatively straightforward but many candidates were unable to interpret the notation given to do the quite straightforward calculation.
1(f) was meant to be a more unusual problem-solving question getting candidates to think about ways of linearizing a non-linear problem. This proved too much for nearly all candidates.
In a large population of hens, the weight of a hen is normally distributed with mean kg and standard deviation kg. A random sample of 100 hens is taken from the population.
The mean weight for the sample is denoted by .
The sample values are summarized by and where kg is the weight of a hen.
It is found that = 0.27 . It is decided to test, at the 1 % level of significance, the null hypothesis = 1.95 against the alternative hypothesis > 1.95.
State the distribution of giving its mean and variance.
Find an unbiased estimate for .
Find an unbiased estimate for .
Find a 90 % confidence interval for .
Find the -value for the test.
Write down the conclusion reached.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
A1
Note: Accept in place of 100.
[1 mark]
A1
Note: Accept 2.00, 2.0 and 2.
[1 mark]
(M1)
= 0.086864
unbiased estimate for is 0.0869 A1
Note: Accept any answer which rounds to 0.087.
[2 marks]
90 % confidence interval is (M1)
= (1.95, 2.05) A1A1
Note: FT their from (c).
Note: Condone the use of the -value 1.645 since is large.
Note: Accept any values that round to 1.95 and 2.05.
[3 marks]
-value is 0.0377 A2
Note: Award A1 for the 2-tail value 0.0754.
Note: Award A2 for 0.0377 and A1 for any other value that rounds to 0.038.
Note: FT their estimated mean from (b), note that 2 gives = 0.032(0).
[2 marks]
accept the null hypothesis A1
Note: FT their -value.
[1 mark]
Examiners report
John rings a church bell 120 times. The time interval, , between two successive rings is a random variable with mean of 2 seconds and variance of .
Each time interval, , is independent of the other time intervals. Let be the total time between the first ring and the last ring.
The church vicar subsequently becomes suspicious that John has stopped coming to ring the bell and that he is letting his friend Ray do it. When Ray rings the bell the time interval, has a mean of 2 seconds and variance of .
The church vicar makes the following hypotheses:
: Ray is ringing the bell; : John is ringing the bell.
He records four values of . He decides on the following decision rule:
If for all four values of he accepts , otherwise he accepts .
Find
(i) ;
(ii) .
Explain why a normal distribution can be used to give an approximate model for .
Use this model to find the values of and such that , where and are symmetrical about the mean of .
Calculate the probability that he makes a Type II error.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
(i) A1
(ii) (M1)A1
Note: If 120 is used instead of 119 award A0(M1)A0 for part (a) and apply follow through for parts (b)-(d). (b) is unaffected and in (c) the interval becomes . In (d) the first 2 A1 marks are for and so the final answer will round to 0.017.
[3 marks]
justified by the Central Limit Theorem R1
since is large A1
Note: Accept .
[2 marks]
(M1)(A1)
(A1)
so (R1)
(M1)
interval is A1A1
Notes: Accept the use of inverse normal applied to the distribution of .
Alternative is to use the GDC to find a pretend confidence interval for a mean and then convert by multiplying by 119.
Either or correct implies the five implied marks.
Accept any numbers that round to these 3sf numbers.
[7 marks]
under (M1)
(A1)
probability that all 4 values of lie in this interval is
(M1)(A1)
so probability of a Type II error is 0.0304 (3sf) A1
Note: Accept any answer that rounds to 0.030.
[5 marks]
Examiners report
An estate manager is responsible for stocking a small lake with fish. He begins by introducing fish into the lake and monitors their population growth to determine the likely carrying capacity of the lake.
After one year an accurate assessment of the number of fish in the lake is taken and it is found to be .
Let be the number of fish years after the fish have been introduced to the lake.
Initially it is assumed that the rate of increase of will be constant.
When the estate manager again decides to estimate the number of fish in the lake. To do this he first catches fish and marks them, so they can be recognized if caught again. These fish are then released back into the lake. A few days later he catches another fish, releasing each fish after it has been checked, and finds of them are marked.
Let be the number of marked fish caught in the second sample, where is considered to be distributed as . Assume the number of fish in the lake is .
The estate manager decides that he needs bounds for the total number of fish in the lake.
The estate manager feels confident that the proportion of marked fish in the lake will be within standard deviations of the proportion of marked fish in the sample and decides these will form the upper and lower bounds of his estimate.
The estate manager now believes the population of fish will follow the logistic model where is the carrying capacity and .
The estate manager would like to know if the population of fish in the lake will eventually reach .
Use this model to predict the number of fish in the lake when .
Assuming the proportion of marked fish in the second sample is equal to the proportion of marked fish in the lake, show that the estate manager will estimate there are now fish in the lake.
Write down the value of and the value of .
State an assumption that is being made for to be considered as following a binomial distribution.
Show that an estimate for is .
Hence show that the variance of the proportion of marked fish in the sample, , is .
Taking the value for the variance given in (d) (ii) as a good approximation for the true variance, find the upper and lower bounds for the proportion of marked fish in the lake.
Hence find upper and lower bounds for the number of fish in the lake when .
Given this result, comment on the validity of the linear model used in part (a).
Assuming a carrying capacity of use the given values of and to calculate the parameters and .
Use these parameters to calculate the value of predicted by this model.
Comment on the likelihood of the fish population reaching .
Markscheme
* This sample question was produced by experienced DP mathematics senior examiners to aid teachers in preparing for external assessment in the new MAA course. There may be minor differences in formatting compared to formal exam papers.
M1
A1
[2 marks]
M1A1
AG
[2 marks]
A1A1
[2 marks]
Any valid reason for example: R1
Marked fish are randomly distributed, so constant.
Each fish caught is independent of previous fish caught
[1 mark]
M1
A1
AG
[2 marks]
M1A1
AG
[2 marks]
(M1)
and A1
[2 marks]
M1
Lower bound upper bound A1
[2 marks]
Linear model prediction falls outside this range so unlikely to be a good model R1A1
[2 marks]
M1
A1
M1
(M1)
A1
[5 marks]
M1A1
Note: Accept any answer that rounds to .
[2 marks]
This is much higher than the calculated upper bound for so the rate of growth of the fish is unlikely to be sufficient to reach a carrying capacity of . M1R1
[2 marks]
Examiners report
Peter, the Principal of a college, believes that there is an association between the score in a Mathematics test, , and the time taken to run 500 m, seconds, of his students. The following paired data are collected.
It can be assumed that follow a bivariate normal distribution with product moment correlation coefficient .
State suitable hypotheses and to test Peter’s claim, using a two-tailed test.
Carry out a suitable test at the 5 % significance level. With reference to the -value, state your conclusion in the context of Peter’s claim.
Peter uses the regression line of on as and calculates that a student with a Mathematics test score of 73 will have a running time of 101 seconds. Comment on the validity of his calculation.
Markscheme
A1
Note: It must be .
[1 mark]
A2
Note: Accept anything that rounds to 0.65
0.649 > 0.05 R1
hence, we accept and conclude that Peter’s claim is wrong A1
Note: The A mark depends on the R mark and the answer must be given in context. Follow through the -value in part (b).
[4 marks]
a statement along along the lines of ‘(we have accepted that) the two variables are independent’ or ‘the two variables are weakly correlated’ R1
a statement along the lines of ‘the use of the regression line is invalid’ or ‘it would give an inaccurate result’ R1
Note: Award the second R1 only if the first R1 is awarded.
Note: FT the conclusion in(a)(ii). If a candidate concludes that the claim is correct, mark as follows: (as we have accepted H1) the 2 variables are dependent and 73 lies in the range of values R1, hence the use of the regression line is valid R1.
[2 marks]
Examiners report
Employees answer the telephone in a customer relations department. The time taken for an employee to deal with a customer is a random variable which can be modelled by a normal distribution with mean 150 seconds and standard deviation 45 seconds.
Find the probability that the time taken for a randomly chosen customer to be dealt with by an employee is greater than 180 seconds.
Find the probability that the time taken by an employee to deal with a queue of three customers is less than nine minutes.
At the start of the day, one employee, Amanda, has a queue of four customers. A second employee, Brian, has a queue of three customers. You may assume they work independently.
Find the probability that Amanda’s queue will be dealt with before Brian’s queue.
Markscheme
* This question is from an exam for a previous syllabus, and may contain minor differences in marking or structure.
Note: In question 2, accept answers that round correctly to 2 significant figures.
(M1)A1
[2 marks]
Note: In question 2, accept answers that round correctly to 2 significant figures.
required to find
let
(A1)
(M1)
(A1)
A1
Note: In (b) and (c) condone incorrect notation, eg, for .
[4 marks]
Note: In question 2, accept answers that round correctly to 2 significant figures.
let (M1)
(A1)
(M1)
= 14175 (A1)
required to find (M1)
= 0.104 A1
[6 marks]
Examiners report
This question compares possible designs for a new computer network between multiple school buildings, and whether they meet specific requirements.
A school’s administration team decides to install new fibre-optic internet cables underground. The school has eight buildings that need to be connected by these cables. A map of the school is shown below, with the internet access point of each building labelled .
Jonas is planning where to install the underground cables. He begins by determining the distances, in metres, between the underground access points in each of the buildings.
He finds , and .
The cost for installing the cable directly between and is .
Jonas estimates that it will cost per metre to install the cables between all the other buildings.
Jonas creates the following graph, , using the cost of installing the cables between two buildings as the weight of each edge.
The computer network could be designed such that each building is directly connected to at least one other building and hence all buildings are indirectly connected.
The computer network fails if any part of it becomes unreachable from any other part. To help protect the network from failing, every building could be connected to at least two other buildings. In this way if one connection breaks, the building is still part of the computer network. Jonas can achieve this by finding a Hamiltonian cycle within the graph.
After more research, Jonas decides to install the cables as shown in the diagram below.
Each individual cable is installed such that each end of the cable is connected to a building’s access point. The connection between each end of a cable and an access point has a probability of failing after a power surge.
For the network to be successful, each building in the network must be able to communicate with every other building in the network. In other words, there must be a path that connects any two buildings in the network. Jonas would like the network to have less than a probability of failing to operate after a power surge.
Find .
Find the cost per metre of installing this cable.
State why the cost for installing the cable between and would be higher than between the other buildings.
By using Kruskal’s algorithm, find the minimum spanning tree for , showing clearly the order in which edges are added.
Hence find the minimum installation cost for the cables that would allow all the buildings to be part of the computer network.
State why a path that forms a Hamiltonian cycle does not always form an Eulerian circuit.
Starting at , use the nearest neighbour algorithm to find the upper bound for the installation cost of a computer network in the form of a Hamiltonian cycle.
Note: Although the graph is not complete, in this instance it is not necessary to form a table of least distances.
By deleting , use the deleted vertex algorithm to find the lower bound for the installation cost of the cycle.
Show that Jonas’s network satisfies the requirement of there being less than a probability of the network failing after a power surge.
Markscheme
(M1)(A1)
Note: Award (M1) for substitution into the cosine rule and (A1) for correct substitution.
A1
[3 marks]
(M1)
A1
[2 marks]
any reasonable statement referring to the lake R1
(eg. there is a lake between and , the cables would need to be installed under/over/around the lake, special waterproof cables are needed for lake, etc.)
[1 mark]
edges (or weights) are chosen in the order
A1A1A1
Note: Award A1 for the first two edges chosen in the correct order. Award A1A1 for the first six edges chosen in the correct order. Award A1A1A1 for all seven edges chosen in the correct order. Accept a diagram as an answer, provided the order of edges is communicated.
[3 marks]
Finding the sum of the weights of their edges (M1)
total cost A1
[2 marks]
a Hamiltonian cycle is not always an Eulerian circuit as it does not have to include all edges of the graph (only all vertices) R1
[1 mark]
edges (or weights) are chosen in the order
A1A1A1
Note: Award A1 for the first two edges chosen in the correct order. Award A1A1 for the first five edges chosen in the correct order. Award A1A1A1 for all eight edges chosen in the correct order. Accept a diagram as an answer, provided the order of edges is communicated.
finding the sum of the weights of their edges (M1)
upper bound A1
[5 marks]
attempt to find MST after deleting vertex D (M1)
these edges (or weights) (in any order)
A1
Note: Prim’s or Kruskal’s algorithm could be used at this stage.
reconnect to MST with two different edges (M1)
A1
Note: This A1 is independent of the first A mark and can be awarded if both and are chosen to reconnect to the MST, even if the MST is incorrect.
finding the sum of the weights of their edges (M1)
Note: For candidates with an incorrect MST or no MST, the weights of at least seven of the edges being summed (two of which must connect to ) must be shown to award this (M1).
lower bound A1
[6 marks]
METHOD 1
recognition of a binomial distribution (M1)
finding the probability that a cable fails (at least one of its connections fails)
OR A1
recognition that two cables must fail for the network to go offline M1
recognition of binomial distribution for network, (M1)
OR A1
therefore, the diagram satisfies the requirement since AG
Note: Evidence of binomial distribution may be seen as combinations.
METHOD 2
recognition of a binomial distribution (M1)
finding the probability that at least two connections fail
OR A1
recognition that the previous answer is an overestimate M1
finding probability of two ends of the same cable failing, ,
and the ends of the other cables not failing,
(A1)
A1
therefore, the diagram satisfies the requirement since AG
METHOD 3
recognition of a binomial distribution M1
finding the probability that the network remains secure if or connections fail or if connections fail provided that the second failed connection occurs at the other end of the cable with the first failure (M1)
(remains secure) A1
A1
(network fails) A1
therefore, the diagram satisfies the requirement since AG
METHOD 4
(network failing)
M1
A1A1A1
Note: Award A1 for each of 2nd, 3rd and last terms.
A1
therefore, the diagram satisfies the requirement since AG
[5 marks]
Examiners report
This question part was intended to be an easy introduction to help candidates begin working with the larger story and most candidates handled it well. However, it was surprisingly common for a candidate to correctly choose the cosine rule and to make the correct substitutions into the formula but then arrive at an incorrect answer. The frequency of this mistake suggests that candidates were either making simple entry mistakes into their GDC or forgetting to ensure that their GDC was set to degrees rather than radians.
(b) and (c) Most candidates were able to gain the three marks available.
(d), (f) and (g) These three question parts required candidates to demonstrate their ability to carry out graph theory algorithms. Kruskal’s algorithm was split into two different question parts to guide candidates to show their work. As a result, many were able to score well in part (d)(ii) either from having the correct MST or from “follow through” marks from an incorrect MST. However, without this guidance in 2(f) and 2(g), many candidates did a poor job of showing the process they were using to apply the algorithms. The candidates that scored well were detailed in showing the order of how edges were selected and how they were being summed to arrive at the final answers. Although “follow through” within the problem was not available for the final answer in parts 2(f) and 2(g), many candidates missed the opportunity to gain the final method mark in both parts by not fully showing the process they used.
Many candidates were able to state the definitions of Hamiltonian cycles and Eulerian circuits. However, the question was not asking for definitions but rather a distinct conclusion of why a Hamiltonian cycle is not always an Eulerian circuit. Disappointingly, many incorrect answers contained five or more lines or writing that may have used up exam time that could have been devoted to other question parts. Another common mistake seen here was candidates incorrectly trying to state a reason based on the number of odd vertices.
This question was very challenging for almost all the candidates. Although there were several different methods that candidates could have used to answer this question, most candidates were only able to gain one or two marks here. Many candidates did recognize that something binomial was needed, but few knew how to setup the correct parameters for the distribution.